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Abstract 

Given the large number of new musical tracks released 
each year, automated approaches to plagiarism detection 
are essential to help us track potential violations of copy¬ 
right. Most current approaches to plagiarism detection 
are based on musical similarity measures, which typically 
ignore the issue of polyphony in music. We present a 
novel feature space for audio derived from compositional 
modelling techniques, commonly used in signal separa¬ 
tion, that provides a mechanism to account for polyphony 
without incurring an inordinate amount of computational 
overhead. We employ this feature representation in con¬ 
junction with traditional audio feature representations in 
a classification framework which uses an ensemble of dis¬ 
tance features to characterize pairs of songs as being pla¬ 
giarized or not. Our experiments on a database of about 
3000 musical track pairs show that the new feature space 
characterization produces significant improvements over 
standard baselines. 

Index Terms: music plagiarism detection, poly¬ 
phonic music, similarity measures, compositional mod¬ 
els, monaural signal separation 

1. Introduction 

Music-plagiarism is the use or close imitation of another 
author’s music without proper acknowledgement. Ev¬ 
ery year, vast numbers of new music tracks are released 
globally, and questionable similarities exist in some sec¬ 
tions of music tracks. Aided by the internet, plagiarism 
is now noticeable globally, not just across authors, but 
also across languages and countries. In 2008 alone, 1.4 
billion music tracks were sold internationally. This num¬ 
ber has since increased to over 1.8 billion. In 2004, the 
SACEM, an organization that seeks to protect the rights 
of the original authors, composers and publishers, was 
able to manually check only a small percentage of reg¬ 
istered pieces for potential copyright violations. With 
such vast numbers of tracks to monitor, the need for auto¬ 


matic techniques for identification of potential copyright 
violations and detection of music-plagiarism is clear and 
paramount. 

Current approaches to plagiarism detection use tech¬ 
niques based on musical similarity analysis, which em¬ 
phasizes finding musical pieces in large databases for 
retrieval. Various feature sets have been proposed for 
characterization of musical tracks, including the use of 
pitch contours, loudness, and cepstral features DQ 12- 
Approaches to computing similarity given the character¬ 
izations of two musical tracks use various approaches 
including n-gram based similarity, geometric distances, 
and string-matching algorithms 0 SI- There are two 
main issues with adopting similar techniques for pla¬ 
giarism detection- first, these approaches typically ig¬ 
nore the issue of polyphony in the recordings, simply 
using a monophonic approximation instead 8). Poly¬ 
phonic music can have multiple overlapping notes and 
is far more complex from an analysis perspective than 
monophony. Feature extraction techniques traditionally 
used for monophonic music cannot be expected to do a 
good job of representing polyphonic characteristics. As a 
result, these methods have limited success when applied 
on polyphonic music. Second, unlike similarity compu¬ 
tation for retrieval where a system returns a ranked list, 
a plagiarism detector needs to decide whether a pair of 
songs are sufficiently similar that one may have been pla¬ 
giarized from the other. 

In this paper, we present an approach that can ef¬ 
fectively deal with these issues. To tackle the problem 
of polyphony, we present a novel feature set derived 
from signal separation based on compositional models 
0- This feature set represents the magnitude spectrum 
of each frame in a musical data segment as an additive, 
weighted combination of a set of bases. The bases are not 
constrained to be physically meaningful in this work, but 
such constraints may be applied as well; e.g. each base 
could represent the different notes that are expected to be 
present. In this framework, the weights assigned to bases 


to compose each spectral vector can be used as a feature 
representation for the audio frame. 

The second problem of requiring a decision, as op¬ 
posed to a ranked list in case of retrieval, is tackled by 
formulating the problem as a discriminative classifica¬ 
tion task. Given a pair of musical segments and vari¬ 
ous feature characterizations, we compute an ensemble 
of distance-based features. These computed distances 
serve as a representation of the pair of musical pieces. 
Each pair is a datapoint with a corresponding label that 
indicates whether one of the songs is plagiarized from 
the other or not. We use the labels in conjunction with 
the distance-based features to train a classification model. 
Now, given a pair of test musical segments, the system 
can use this model to decide whether one of the segments 
may have been plagiarized from the other. 

While applications aside from plagiarism detection 
are beyond the scope of this paper, the techniques de¬ 
scribed can easily be applied to other related tasks. For 
instance, for the task of retrieving similar segments, given 
a musical segment (Mi), one can query the classification 
model using Mi and all other musical segments in the 
database. Such a system could use the distance from the 
decision boundary as a score to create a ranked similar¬ 
ity list. The use of signal separation-based embedding 
for musical segments can also be adopted for other music 
tasks in polyphonic settings. 

The remainder of this paper is organized as follows: 
in Section [2j we present a detailed formulation of the 
problem of music-plagiarism detection. Section [3] dis¬ 
cusses the feature representations we use for music tracks 
as well as to characterize pairs of musical segments as 
plagiarized or not. In Section [4] we describe the dataset 
used in our experiments and present our results and our 
discussion of the same. We conclude the paper in Section 
0 

2. Problem Formulation 

In this paper, we approach the task of detection of mu¬ 
sic plagiarism in a pairwise discriminative classification 
framework, where given a pair of musical segments, we 
wish to predict whether the pair are sufficiently similar so 
as to be considered plagiarized. Each musical track (a:) is 
first represented using a set of feature vectors denoted by 
F(x) = {/i, / 2 , /„}. Each of the individual f k here 

represent a class of features, e.g. Mel-Frequency Cepstral 
Coefficient features, pitch contour features. 

We then define a set of n distance functions (one for 
each feature class extracted for a song), <f>i, fa, over 
pairs of such tracks, x, and Xj , where each of these func¬ 
tions <f>k computes the distance between the feature vec¬ 
tors of the two music tracks for the fc-th feature class: 

(t> k {xi,Xj) = D(f k (xi), fk(xj)),Vh e [1,2...n] (1) 

where D represents a distance function that is com¬ 


puted over the fc-th feature classes for each of the music 
tracks. As we describe in Section [3731 we use an edit- 
distance measure to compute distance for our task; we 
note, however, that any other distance metric could be 
used in this framework, if applied to a different task. The 
set of distance scores, thus computed, behaves, in effect, 
as a feature characterization of the degree of difference 
between the two songs. 

F{Xi,Xj) = {(j)i{Xi,Xj),(t>2{x i ,X : j), ..<j> n (Xi,Xj)} (2) 

At training time, given information about whether the 
pair of songs, X{, Xj, represent a positive instance of 
plagiarism, we can use the distance based feature set in 
a supervised classification framework to train a model 
that can predict plagiarism, given a pair of music tracks. 
Let ui represent the set of weights for the features learnt 
at training time, and Q represent the function that com¬ 
putes a score for the datapoint (e.g. Q(w,J-(xi,Xj)) = 
Sl-=i w kFk, for linear regression) . We can then obtain 
a label L for the pair of music tracks using a thresholded 
classifier score as follows: 

H(xi, Xj) = Q(w, F{xi, Xj)) — p (3) 

+1, if H(xi, Xj) > 0, 

— 1, if H(xi, Xj) < 0. 

where L = +1 denotes plagiarized. Unlike tasks requir¬ 
ing similarity computation between tracks for search-like 
applications, our task of detecting plagiarism is differ¬ 
ent in that we cannot simply pick the few most similar 
songs as potentially plagiarized, since this would result 
in a large number of songs that need manual examina¬ 
tion. The parameter p is therefore used as a threshold in 
this formulation, and we try to find an optimal value for 
this parameter so that we make the fewest mistakes. 

3. Feature Representations from 
compositional models 

Our feature-space design is primarily motivated by the 
fact that most current approaches do not explicitly ad¬ 
dress the issue of polyphony in recordings, simply us¬ 
ing a monophonic approximation instead, while others 
consider polyphony as a more general multidimensional 
mathematical issue. While an ideal solution to this prob¬ 
lem would involve separating out the multiple notes or 
voices in the recording into different tracks, this would 
require information for each track that is not likely to be 
available. 

We introduce a novel feature set that is based on 
compositional representation of the magnitude spectra. 
Specifically, we use a non-negative matrix factorization 
(NMF)-based embedding for the music tracks J6), that 
we expect will account for polyphony much better than 


the feature sets traditionally used for music representa¬ 
tion. These NMF-based features are used in conjunction 
with traditional feature representations. In the follow¬ 
ing, we first describe the NMF-based feature extraction 
technique for audio, and then briefly describe tradition¬ 
ally used feature sets in Section 3.2 In Section 3.3 we 


discuss the distance function used to create a characteri¬ 
zation for pairs of tracks. 


3.1. NMF features 

NMF is a subspace analysis technique which obtains 
a parts-based representation of data by imposing non¬ 
negative constraints ©■ Given training data, NMF can 
learn a set of basis vectors so that we can represent any 
datapoint as a linear weighted non-negative combination 
of these vectors. We use the magnitude spectra of the au¬ 
dio signals as our data, since they are guaranteed to be 
non-negative. We can represent M t , a magnitude spectral 
vector at time t as: 


N 

M t = b iWi't (5) 

i=l 

where b, is the z-th basis vector and w 2 j is the weight of 
the basis in frame t. N is the number of basis vectors. 

If we represent the set of basis vectors using matrix 
B y = [bi,.... b y], the model and the weights using the 
matrix [Wjv]i,t = we can write the model as. 


M = BW (6) 

NMF has been applied to various audio tasks, includ¬ 
ing blind source separation and separation of speech from 
music 00. The intuition behind using the NMF formu¬ 
lation for music is that a polyphonic music segment will 
be composed additively from various notes, and NMF can 
estimate the contribution of the various notes. Thus, each 
audio frame can be represented using the NMF technique 
in the iV-dimensional basis space, where the weights for 
the frame correspond to the co-ordinates for the frame in 
this space. NMF has a significant advantage over dimen¬ 
sionality reduction techniques such as PCA and ICA in 
that the number of bases used need not be less than the 
original space, resulting in an overcomplete basis space. 
For this task, the basis vectors may be thought of as indi¬ 
vidual notes present in the music. 

We use an exemplar-based basis set for our exper¬ 
iments in this paper, where bases are drawn randomly 
from a collection of spectral vectors for the source (mag¬ 
nitude spectra vectors in the music data, in our case). 
Such bases, although lacking an intuitive interpretation, 
have useful theoretical properties 0. For alternate tasks, 
where more information about the notes and instruments 
is available, one could constrain the basis set to consist of 
true notes or learn them from audio libraries. 


Once the set of bases B is selected, each magnitude 
spectral vector in the dataset can be represented as a non- 
negative-weighted combination of the bases. The weights 
are obtained using an iterative update rule minimizing a 
generalized Kullback Leibler divergence 0 between M 
and BW as follows: 


B [sw] 

W = W ®(7) 

B^ .1 

where 1 is a matrix of ones and the operation ® denotes 
element-wise multiplications. All divisions are element¬ 
wise, as well. Weights W are initialized to unity, and 
we iterate equation ([7} to convergence. For each mu¬ 
sic piece, we now have a sequence of weight vectors W 
which can be used as a feature representation for the au¬ 
dio in the basis space. These sequences correspond to 
on t feature class , as described in Section [2] We used a 
1024-dimensional representation of each audio frame in 
the data and 64 basis vectors for all experiments reported 
in this paper. 

3.2. Traditional Representations for Audio 

In addition to the features extracted from the NMF, we 
extract features using traditional means of analysis of 
the audio content, that describe the temporal and spectral 
sound structures effectively. We use the F-score for iden¬ 
tifying the more discriminative features for the dataset 
used m. We expect that augmenting these with the 
NMF-based features will lead to increased efficiency in 
the detection of similar music documents in polyphonic 
settings. 

In general, perception of structural boundaries in mu¬ 
sic is mostly influenced by variations in timbre, tonality 
and rhythm. Timbral features, such as spectral rolloff, 
which estimates the amount of high frequency of the sig¬ 
nal, and zero crossing rate are extracted. Key strength, 
a tonality feature which indicates the cross-correlation 
score for each different tonality candidate, is another fea¬ 
ture extracted. Other features extracted from the audio 
include Mel-Frequency Cepstral Coefficients, the relative 
Shannon entropy, indicating predominant peaks, kurto- 
sis, indicating trends in the audio signal, standard devia¬ 
tion, skewness, and the amplitude envelope. We also use 
the novelty curve, obtained from convolution along the 
main diagonal of the similarity matrix using a Gaussian 
checkerboard. We use this feature set as the baseline for 
comparison with the enhanced system that also includes 
the NMF-based features. 

3.3. Distance Features for Pair Characterization 

In the previous subsections, we described the sets of fea¬ 
tures extracted for each music track. We use these fea¬ 
tures to compute distances which we use to characterize 
pairs of tracks as plagiarized or not. Given the feature 





representations for the two tracks, we use the Dynamic 
Time Warping (DTW) algorithm to compute distances 
between the corresponding feature sequences. Dynamic 
time warping is a dynamic programming algorithm for ef¬ 
ficiently computing the optimal alignment of non-linearly 
expanded or contracted sequences. 

Based on an inspection of the dataset, we observed 
that the music in a pair of plagiarized pieces usually 
differed in rhythm, while retaining other properties that 
made them sound similar. We incorporate this intuition 
into our distance computation by using a modified ver¬ 
sion of the DTW algorithm for rhythm-based features, 
with local weights to prefer insertions or deletions to sub¬ 
stitutions to better account for the variation in rhythm. 
The modification is as follows: 

! D(i - 1, J - 1) + w sub .d(i,j) 

D(i,j - 1) +w de i.d(i, j) (8) 

D(i - 1 ,j) +w ins .d(i,j ) 

We make w sub much larger than Wdei and Wi na . This 
follows from the fact that, even though the DTW would 
effectively match two sequences having the same musical 
property with varying rhythm, the distance between the 
two sequences would still be large owing to the numerous 
insertions and deletions required. Thus, using a lower 
weight for insertion and deletion would help in bringing 
out this similarity. 

However, this modification has a potential disadvan¬ 
tage in that the warping path might begin to prefer axis- 
parallel trajectories due to the significantly lower costs of 
insertions and deletions. This is overcome by constrain¬ 
ing the slope of the warping path over short windows to 
be within a pre-specified range. 


4. Experimental Results 

We used the MIRToolBox in MATLAB ITO to extract 
the traditional audio-based features described in Section 


3.2 This was followed by extraction of the NMF fea¬ 


tures, and the DTW-based distance computation to set up 
pairwise characterization of tracks, as described in Sec¬ 
tion [3 These distance features and true class labels for 
the training data were then used to train a random forest 
classifier El with 150 trees. 


4.1. Dataset 

We created a database of 2966 song pairs, comprising 
of 966 plagiarized (positive data set) and 2000 non- 
plagiarized (negative data set) song pairs. The positive 
instances were obtained from music covers and plagia¬ 
rism lawsuits lfl3l lT4ll including the Music Copyright In¬ 
fringement Resource of the UCLA School of Law. The 
dataset is comprised of music from a wide range of gen¬ 
res and languages. All recordings were resampled to a 
uniform 16 kHz sampling rate with a frame length of 40 


Table 1: Overall classification accuracy comparison for 
the various systems on entire test data 


Baseline 

NMF-only 

Enhanced 

45.1% 

72.6% 

78.4% 



Precision 



Baseline NMF Baseline 
+NMF 


Figure 1: (L): Precision-Recall ROC curves for the vari¬ 
ous systems on the plagiarized instances in test data; (R) 
Corresponding Area Under the Curve (AUC)for the var¬ 
ious systems 


ms. The training and test data were separated randomly 
to provide a 9:1 train-test split. 


4.2. Results 

We compare performance of 3 systems on the test data. 
We use the traditional feature sets described in Section 
3.2 as the baseline, and compare it with performance us¬ 
ing the NMF -features only, as well as an enhanced fea¬ 
ture set using baseline features along with NMF-features. 
First, a comparison of overall classification accuracy on 
the test set is shown in Table Q] We find that the NMF- 
only system significantly outperforms the system using 
the baseline feature set, while the enhanced system sig¬ 
nificantly outperforms both of them. 

Figure [1] compares performance of the systems using 
ROC plots for precision and recall for the plagiarized in¬ 
stances in the test data. Since precision and recall are 
both metrics of accuracy, the higher the Area Under the 
Curve (AUC), the better the performance. We note that 
the NMF-only outperforms the baseline, but the enhanced 
system significantly outperforms the baseline and NMF- 
only systems. 

Figure [2] shows the successful and failed detection 
rates for plagiarized track pairs from different genres. 
Our method performs the best in the country, jazz and 
rock categories, and worst in hip-hop. This may be be¬ 
cause the presence of rap sequences make the detection 
of these song pairs difficult. 

An inspection of plagiarized sequences shows that 
plagiarized pieces often contain only a small sequence 























Figure 2: Performance across genres (Baseline+NMF 
features) 

that is similar to the original, and our feature set does 
not do anything to explicitly address this. To account for 
such cases, one may consider deriving features from the 
alignment trajectories, to detect local occurrences of sys¬ 
tematically varied rhythms. 

While existing systems for the detection of near¬ 
duplicate music documents can be used for plagiarism 
detection m, we observe that their performance wors¬ 
ens in polyphonic settings, failing in a number of cases 
where our method proves successful, e.g. song pairs He’s 
So Fine and My Sweet Lord, Oye Mi Canto and Paginas 
De Mujer. Our method proves least effective in cases 
where the similar portions are in the background, e.g. a 
copied guitar riff in the musical pieces involved in the La 
Cienega Music Co. v. ZZ Top plagiarism suit. 

5. Conclusion 

In this paper, we proposed a novel feature space derived 
from techniques commonly used in signal separation to 
account for polyphony in music recordings. We formu¬ 
lated the task of plagiarism detection in a supervised clas¬ 
sification framework which using distance features over 
pairs of music tracks to learn the model. Our approach re¬ 
sulted in a significant improvement in performance over 
the baseline metrics. It is worth noting that our method 
does not dispense with the need for agencies that track 
potential copyright violations. Ideally, it should be used 
as a filtering mechanism that identifies extremely similar 
music samples. 

This work used exemplar bases obtained from the 
data for NMF- alternate learning methods for the bases 
(such as training of the basis set from the data, using ex¬ 
amples of expected notes for initialization) and effects of 
changing the basis set size are directions that we expect 
to explore in future work. 

The NMF-based feature representation should be use¬ 
ful in other tasks that deal with polyphonic music as well. 
However, for music retrieval tasks such as query by hum¬ 
ming or song-matching tasks, we need to be especially 


careful in the training/selection of basis vectors, because 
such retrieval tasks are often applied to databases that in¬ 
clude user-generated content, typically of a worse quality 
than studio recordings due to background noise. For such 
tasks, we would need to have an additional step for noise 
removal/reduction or ensure that the basis vectors are not 
trained from studio-quality recordings only. We continue 
to actively explore these directions. 
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